Given the recent talk about new Conservative leadership candidates and their potential to hold the 2019 Conservative coalition together, I've bee thinking about voters. Specifically I've been thinking about types of voters. Lots of voters focus on voters classed by their liklihood to vote, for the purposes of campaigning. But what about identifying groups of voters using more than just their propensity to vote.
Voter Data
The data we'll be using is the latest wave of the British Election Study (BES) panel study. This data contains interesting fields concerning, including the left-right alignment of voters. Below I've provided a plot showcasing the left/right alignment of voters by their vote in the 2019 general election.
The data also contains data on voter demographics as well as the respondant's best recollection of their previous voting behaviour. Altogether, the BES panel is an incredibly rich dataset for exploring voter behaviour.
K-means Clustering
For the cluster modelling, we'll be using the k-means clustering algorithm. In layman's terms, the algorithm works by placing k points throughout the data, called "centroids". Where 'k' is some number that we have to determine prior to fitting the algorithm. The points that are closest to that centroid are identified as being in that centroid's cluster. How much of the data is explained by those centroids is then calculated. Then the centroids move, and the process starts again. This continues until the amount of the data that is explained by the clusters is optimised. But how do we decide on how many centroids to start with?
To determine how many centroids we should use, the value of 'k', we will use an elbow plot. This plots how much of the data can be explained by an algorithm with k clusters. Obviously more clusters would increase the amount of the data the algorithm can explain. However, the marginal impact from adding another cluster reduces as we add more clusters. The optimal number of clusters would be where the value of k where there is a sudden change if the line, like an elbow.
In the plot above the change is relatively subtle, particularly as the plot is a little stretched. We'll be using 4 clusters for the algorithm we'll build.
Clustering Voters
With our value of k determined, we can then fit the algorithm. Visualising the clusters is a little difficult given the number of variables in our data. However, what we can do is use another algorithm called Principal Components Analysis (PCA). This reduces the number of columns by turning combinations of columns into "components" based on how much of the data they explain. This allows us to visualise the clusters we've identified a little easier. Below I've plotted the voter clusters with the largest two of these components, with the percentage of the data they explain in brackets.
I admit, it's a little difficult to see the difference with only two of these components. So I've put together a 3D representation of the clusters with a third component added.